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HYBRID SHRINKAGE ESTIMATORS USING PENALTY BASES 
FOR THE ORDINAL ONE-WAY LAYOUT 

By Rudolf Beran^ 

University of California, Davis 

This paper constructs improved estimators of the means in the 
Gaussian saturated one-way layout with an ordinal factor. The least 
squares estimator for the mean vector in this saturated model is usu- 
ally inadmissible. The hybrid shrinkage estimators of this paper ex- 
ploit the possibility of slow variation in the dependence of the means 
on the ordered factor levels but do not assume it and respond well 
to faster variation if present. To motivate the development, candi- 
date penalized least squares (PLS) estimators for the mean vector 
of a one-way layout are represented as shrinkage estimators rela- 
tive to the penalty basis for the regression space. This canonical 
representation suggests further classes of candidate estimators for 
the unknown means: monotone shrinkage (MS) estimators or soft- 
thresholding (ST) estimators or, most generally, hybrid shrinkage 
(IfS) estimators that combine the preceding two strategies. Adap- 
tation selects the estimator within a candidate class that minimizes 
estimated risk. Under the Gaussian saturated one-way layout model, 
such adaptive estimators minimize risk asymptotically over the class 
of candidate estimators as the number of factor levels tends to infinity. 
Thereby, adaptive HS estimators asymptotically dominate adaptive 
MS and adaptive ST estimators as well as the least squares estimator. 
Local annihilators of polynomials, among them difference operators, 
generate penalty bases suitable for a range of numerical examples. In 
case studies, adaptive HS estimators recover high frequency details 
in the mean vector more reliably than PLS or MS estimators and low 
frequency details more reliably than ST estimators. 

1. Introduction. Consider the one-way layout of ANOVA. A single factor 
that influences the observed responses has p distinct levels {sj : 1 < i < p}. 
These factor levels can be either nominal (i.e., pure labels that bear no 
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ordering information) or ordinal (i.e., real numbers whose order and spacing 
carries information). In the case of an ordinal factor, we will suppose that 
the factor levels have been ordered from smallest to largest. At level Sj, we 
observe measurements {yij : 1 < j < nt}. The saturated Gaussian model for 
the one-way layout asserts that the observations {yij} satisfy 

(1.1) Vij = tJ-i + eij , l<i<p,l<3 <ni- 

Here the errors e = {ejj} are independent, identically distributed, each hav- 
ing an N(0,a'^) distribution and the means {ni} are unknown real numbers 
subject to no restrictions. That the means depend on the respective factor 
levels can be expressed formally by 

(1.2) fii = m{si), l<i<p- 

In equation (1.2), the function m is real-valued, unknown, and is subject to 
no restrictions. 

At first glance, the saturated one-way layout model expressed by equa- 
tions (1.1) and (1.2) resembles a model for curve estimation. However, there 
is a fundamental distinction. In curve estimation, the domain of m is a con- 
tinuum, usually a closed subset of the real line. In the one-way layout, the 
domain of the function m is a discrete set of factor levels. Even in ordinal 
one-way layouts, no credible extension of m to a larger domain may exist. 
Tukey [(1977), Chapter 7] fitted several examples of ordinal one-way layouts 
that are not curve estimation problems because of intrinsic limitations on 
the domain of the function m. 

Hereafter, unless otherwise stated, we consider only ordinal one-way lay- 
outs. The following examples will serve as test cases for our methods: 

Example 1. The top subplot in Figure 1 displays monthly Australian 
red wine sales (in kiloliters) from January 1980 to October 1991. The data 
was reported by Brockwell and Davis (1996) and was analyzed there with 
techniques based on ARMA models. ARMA models are only one class of 
hypothetical probability models that might be entertained as a way of mim- 
icking the wine sales data. Because the data is not actually random, it is 
prudent to carry out alternative analyses. As Tukey (1980) pointed out, "In 
practice, methodologies have no assumptions and deliver no certainties." We 
will analyze the wine-sales data with mean estimators derived for the ordinal 
one-way layout model. Motivating this approach is the traditional decom- 
position of an econometric times series into a deterministic term (trend plus 
seasonal variation), plus a random noise term. The factor levels are the 
142 successive months in the period considered and are clearly ordinal. Ipso 
facto, mean monthly wine-sales are defined only on the discrete time grid 
of months. Our analysis in Section 2.5 finds a highly intelligible seasonal 
pattern in the wine sales. 
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Fig. 1. Competing DA-basis fits to the Australian monthly red wine-sales data. 
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Fig. 2. Diagnostics for D4-basis fits to the Australian monthly red wine-sales data: resid- 
uals for the HS(I)4J fit, the empirical basis economy plot and the shrinkage vectors used 
by competing D4-basis estimators. 
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Example 2. The artificial ordinal one-way layouts analyzed in Figures 
3 and 4 are designed to bracket the situation found in the case study of Ex- 
ample 1. In each of Figures 3 and 4, the data in the top subplot is obtained 
by adding pseudo-random errors to the means displayed in the second sub- 
plot. The means in Figure 3 vary slowly while those in Figure 4 vary rapidly. 
To the human eye, the pattern of variation in the means is not visible in 
the data. In Section 2.6, comparing competing estimators of means on these 
two artificial ordinal one-way layouts adds to our understanding of their 
performance. 

Form the n x 1 observation vector y = {{yij - ^ ^ j <ni},l <i <p}, where 
n is the total number of observations. Let X be the n x p incidence matrix 
that links observations to the relevant factor level. The ith column of X 
contains rii ones, the other elements being zeroes. Let fi = (/ii,/X2; • • • , A*p)', 
where /ij satisfies (1.2) with m unrestricted. The saturated model (1.1) is 
equivalent to the assertion 

(1.3) y N [t] , In) where rj = X fi. 

The primary task in this paper is to devise regularized estimators of rj, or, 
equivalently, of /x = {X' X)~^ X't], that (asymptotically in p) dominate the 
least squares estimator ^ls = X{X'X)~^X'y under the saturated ordinal 
model. We note that the desirability of analyzing the risk of estimators of t] 
under the saturated model is a basic way in which estimation in the one-way 
layout differs from curve estimation. 

Suppose that we assess any estimator fj through its normalized quadratic 
loss and corresponding risk 

(1.4) L(57,^) =p-^|??-r/p, R{fj,r],a'^) = EL{fj,r]), 

the expectation being calculated under the saturated model. Equivalently, 
we could discuss estimation of under the loss function p~^{jl — ^)'X'X{fi — 
fi). The risk of ?7ls is evidently a^. It is well known that this value is the 
smallest risk attainable by unbiased estimators of r] in the saturated model 
whether the factor is nominal or ordinal. Nevertheless, for both types of 
factor, ryLS is an inadmissible estimator of r] whenever the number p of 
factor levels exceeds two [Stein (1956)]. 

The James-Stein (1961) shrinkage estimator of ij improves significantly 
on the quadratic risk of 57LS and is a good answer when the factor is nominal. 
For an ordinal factor, estimators for r] that have still lower risk in the one-way 
layout are often possible. The better estimators of 77 developed in this paper 
rely on a regularization strategy that enables the data to influence estimator 
construction. Our hybrid shrinkage estimators exploit the possibility of slow 
variation in the dependence of the means on the ordered factor levels, but 
do not assume it, and respond well to faster variation if present. 
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Fig. 3. Competing D4-basis fits to the Smooth artificial data and the empirical basis 
economy plot. 
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Fig. 4. Competing D4-basis fits to the Very Wiggly artificial data and the empirical basis 
economy plot. Interpolating lines are added to guide the eye through the sequence of means 
or estimated means. They have no further significance. 
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The broad approach is the following: (a) use prior conjecture about the un- 
known means in the Gaussian saturated one-way layout to motivate classes 
of candidate estimators for these means; (b) estimate the risk of each candi- 
date estimator under the saturated model; (c) define an adaptive estimator 
to be the candidate procedure with smallest estimated risk; (d) experiment 
with the adaptive estimator on both observed and artificial data; (e) study 
the asymptotic risk of such adaptive estimators under the saturated model. 

The inadmissibility of least squares fits to the means of a Gaussian satu- 
rated one-way layout has inspired considerable work on competing estima- 
tors. Candidate model selection, ridge regression or penalized least squares 
(PLS) estimators are all particular symmetric linear estimators. Important 
studies of symmetric linear estimators include Stein (1981), Li and Hwang 
(1984), Buja, Hastie and Tibshirani (1989) and Kneip (1994). Tukey (1977) 
proposed and experimented with certain smoothing algorithms for fitting or- 
dinal one-way layouts. Beran and Diimbgen (1998) used a finite-dimensional 
version of Pinsker's (1980) asymptotic minimax bound to assess adaptive 
symmetric linear estimators that perform monotone shrinkage relative to a 
fixed orthonormal basis. 

Adaptive hybrid shrinkage (HS) estimators for the vector rj, the main 
contribution of this paper, combine monotone shrinkage (MS) — a generaliza- 
tion of PLS — with the soft-thresholding (ST) idea in Donoho and Johnstone 
(1995). The adaptive HS estimators are devised to dominate asymptotically 
both adaptive MS and adaptive ST estimators of rj. Theorem 4.1 gives the 
supporting risk analysis under the saturated model as the number p of factor 
levels tends to infinity. Interpretation of asymptotic minimax Theorem 3.1 
isolates basis economy as a key factor in superior performance of MS estima- 
tors and approximate basis economy as a key factor in superior performance 
of HS estimators. Applied to the penalty bases used in this paper, this in- 
terpretation suggests that HS estimators behave like MS estimators when 
the means of an ordinal one-way layout vary slowly and share the superior 
ability of ST estimators to track means that vary more rapidly. Related to 
HS estimators in strategy but not in tactics are the hybrid wavelet fits of 
Efromovich (1999). These combine a certain linear shrinkage strategy with 
hard-thresholding of wavelet coefficients. 

Sections 2.5 and 2.6 continue the analysis of Examples 1 and 2. Compar- 
isons through estimated risks are supplemented by basis economy plots and 
shrinkage vector plots that reveal working details of the competing estima- 
tors. The diagnostic plots in these examples support the claim made above 
that basis economy is important for superior performance of MS estimators 
and that approximate basis economy is important for superior performance 
of HS estimators. In particular, the numerical experiments confirm the supe- 
rior ability of adaptive HS estimators constructed on dth difference penalty 
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bases to recover both low and high frequency features in the means of an 
ordinal one-way layout. 

Curve estimation can be split conceptually into two problems: (a) estima- 
tion of means on the ordinal one-way layout of observed factor levels; and 
(b) estimation of the mean function between adjacent factor levels through 
some form of interpolation. The choice of function class in curve estimation 
strongly affects the implicit interpolation scheme. For nonparametric curve 
estimation, adaptive curve estimators that achieve the Pinsker asymptotic 
minimax bound over specified function classes were developed by Efromovich 
and Pinsker (1984) and by Golubev (1987). On the other hand, data does 
not come with an attached probability model. A data analyst interested in 
curve estimation, but not certain of an appropriate function class, might 
reasonably use the techniques of this paper to estimate the means at the 
observed factor levels; and might then experiment with curve estimates ob- 
tained from these by various interpolation schemes. 

This paper distinguishes strictly among data, statistical procedure, proba- 
bility model and pseudo-random numbers. Modern computing environments 
for applied and experimental statistics have returned the distinctions to 
prominence. An adaptive procedure implicitly fits the probability model 
that motivates it. However, using such a procedure on data differs from be- 
lieving that a probability model governs the data. Data is not certifiably 
random. Mathematical study of a statistical procedure under a probability 
model tests the procedure only on virtual data governed by that model. 
Such mathematical explorations become pertinent to statistical theory if 
the probability model can approximate salient relative frequencies in actual 
data of interest. Our understanding of statistical procedures is ultimately 
empirical, aided considerably by suitable diagnostic plots, knowledge of the 
substantive field, and intuitive interpretations of relevant mathematical re- 
sults [cf. Brillinger and Tukey (1985), Section 17, Beran (2001), Section 3, 
and Friedman (2001)]. In such respects, statistics does not differ from other 
sciences that address the world around us. 

2. HS estimators. This section begins by defining PLS estimators for 
the mean vector of the saturated ordinal one-way layout and then MS or 
ST estimators that use the same penalty basis. This background enables 
the definition of HS estimators that combine the MS and ST shrinkage 
strategies. Adaptive HS estimators are designed to perform well whether the 
components of the mean vector vary slowly or more rapidly. Our treatment 
covers both balanced and unbalanced one-way layouts. Section 4 develops 
asymptotic theory under the saturated model that supports the adaptation 
methodology used. 
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2.1. Canonical representation of PLS estimators. As described in the In- 
troduction, the saturated model for the ordinal one-way layout with p factor 
levels asserts that the observation vector y has an N{rj,a'^In) distribution, 
where r] = Xfj,. Here X is the incidence matrix that links observations to the 
relevant factor levels and n is the total sample size. The task is to estimate 
the mean vector rj. Let D be any matrix with p columns, let u be an element 
of the extended nonnegative reals [0, oo], and let | • | denote quadratic norm. 
The candidate PLS estimator of r/ is 

(2.1) fiPi^s{D,i^) = XfLpi^siD,u), 
where 

(2.2) /ipLs(-D,i/) = argmin[|y - X/^p + ^^l-D^ip]- 

It is understood that /ipLs(-D,oo) = limjy_>oo Apls(-C', i/). The foregoing dis- 
plays yield the explicit formula 

(2.3) r?PLs(^, i^) = X{X'X + uD'Dy^X'y. 

Both D and v are to be chosen so to control the quadratic risk of the PLS 
estimator under the saturated model. 

In the leading case of a balanced one-way layout, the matrix X' X is 
a multiple of the identity matrix. Consequently, j/pls ™ay be computed 
equivalently by applying the PLS strategy to the averages {yi. :l <i <p}, 
rather than to the original data. Thus, the case n = p implicitly includes the 
general balanced one-way layout. Of course, estimating o"^ is easier when n 
exceeds p (see Section 2.2). 

A revealing canonical representation of ?7pls(^)'^) is obtained through 
the following algebraic reduction. The replication matrix R = X'X is a 
p X p diagonal matrix whose kth diagonal element is the number of ob- 
servations at factor level Sk- Let A4 denote the the regression space of the 
one-way layout — the subspace spanned by the columns of the incidence ma- 
trix X . The columns of the matrix Uq = XR~^/^ provide an orthonormal 
basis for this regression space. Let B = R~^/^D' DR~^/^ have spectral rep- 
resentation B = FAr', where the eigenvector matrix satisfies V'T = TV' = Ip 
and the diagonal matrix A = diag{Ai} gives the ordered eigenvalues with 
< Ai < A2 < • • • < Ap. This eigenvalue ordering, the reverse of the custom- 
ary, is used here because the eigenvectors associated with the smallest eigen- 
values largely determine the value and performance of candidate estimator 
fiPLsiD,!^). Let U = UoT. It follows from (2.3) that 

(2.4) rjpLs(.D,u) = U{Ip + uAy^U'y. 

The columns of the matrix U define the orthonormal penalty basis for the 
regression space A4 of the one-way layout. 
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Let z = U'y and let /(i^) denote the column vector (1/(1 + z^Ai), 1/(1 + 
1^X2), • • • ; 1/(1 + ^Xp))' , with the understanding that /(oo) = limjy_^oo fi^^)- 
The distribution of z is then iVp(^, a'^Ip), where ^ = U'lj. The candidate PLS 
estimator of ^ implied by expression (2.4) is 

(2.5) ePLs(^, i^) = f/'r)pLs(^, ^) = f{^)z, 

where the multiplication of vectors in the expression to the right is performed 
componentwise as in the S language. Equivalently, 

(2.6) f,PLs{D,u) = UipLs{D,iy) = Udiag{f{u)}U'y. 

Remark. The successive columns {uj :1 < j <p} of the penalty basis 
matrix U = UqT, where Uq = XR~^^^, have a variational characterization: 

• Let 7j denote the jth. column of the eigenvector matrix T. 

• Find a unit vector ui in Ai that minimizes the penalty \D(X'X)~^X'ui\'^ . 
The answer is ui = Uo'y, where 7 is a p x 1 unit vector that minimizes 
\DR-^/^j\^ = -f'B-f. Thus, ui = Uo-fi. 

• Find a unit vector U2 in M that minimizes the penalty \D{X'X)~^X'u2\'^ 
subject to the constraint that U2 is orthogonal to ui. The answer is 
U2 = U'y, where 7 is a p x 1 unit vector orthogonal to 71 that minimizes 
IDR-^/^jI"^ = 'y'B-f. Thus, U2 = Uq-^2- 

• Continue sequential constrained minimization to obtain the penalty basis 
matrix 

(2.7) U = {Uo7i,Uoj2, . . . , Uojp) = UoT. 

2.2. Adaptive MS estimators. The canonical representation (2.6) of PLS 
estimators suggests a larger class of candidate shrinkage estimators that use 
the same penalty basis U. Let 

(2.8) Jms = .^MS(P) = {/ G [0, 1]P : /i > /2 > • • • > fp} 
and let 

(2.9) iMs{DJ) = fz, feTus- 

The candidate MS estimators for r] associated with penalty matrix D are 
defined by 

(2.10) fiMs{D,f) = UiMs{DJ) = Udiag{f}U'y, f e Tms- 

It follows from (2.6) that the candidate PLS estimators are a proper subset 
of the MS family in which the shrinkage vector / is restricted to the form 
{/(^.):z.G[0,oo]}. 
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For any vector x, let ave(x) denote the average of its components. Define 
the function 

(2.11) rMs(/, ct') = aveifa^ + (1 - f)^ei f S [0, 1]^. 

Because \ fiMsiD, f) — 7?P = \fz — ^\'^, it follows that the normalized quadratic 
risk of the candidate MS estimator is 

(2.12) R{fiMsiD,f),v,'^^)=rMsif,C,'J^), /G-^ms- 

In particular, the risk of the candidate PLS estimator is just rMs{f{i^),Cj '^'^)- 

The risk function rMs(/, '^^) depends on the unknown parameters and cj^. 
Having obtained a variance estimator cj^ , we may estimate by and, 
hence, the risk function by 

rusiD, /) = ave[/2<T2 + (i _ /)2(^2 _ ^2^^ 

(2.13) 

= ave[(/ - g)'^z'^] + ave(5), 

where / € J^ms and g = (z^ — a^^jz^ . Expression (2.13) is Stein's (1981) 
unbiased risk estimator combined with an estimator of . Alternatively, 
the risk estimator fMs(-C) /) follows from the argument for Mallows' (1973) 
Cp criterion. 

For fixed penalty matrix D, the shrinkage- ada'ptive MS(D) estimator is 
defined to be ?7ms(-C^! /ms); where 

(2.14) /ms = argminrMs(^,/) = argminave[(/ - gfz\. 

To accomplish the minimization, let IC = {k £ W :ki > k2 > ■ ■ ■ > kp} and 
let 

(2.15) = argminave[(A; — ^)^z]. 

Computation of /c is a weighted isotonic least squares problem that can be 
solved in a finite number of steps with the pool- adjacent- violators algorithm 
[cf. Robertson, Wright and Dykstra (1988)]. Each component of /ms is then 
the positive part of the corresponding component of k, as shown in Beran 
and Diimbgen (1998). 

Remark. The shrinkage adaptive PLS(D) estimator is obtained by re- 
stricting the minimization in (2.14) to monotone shrinkage vectors of the 
form / = /(z^). This weighted nonlinear least squares computation is harder 
than constructing the more ambitious shrinkage adaptive MS(Z)) estimator. 

Useful in risk estimation is the high component variance estimator (j^, 
which uses the strategy of pooling sums of squares from analysis of variance. 
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Choose U so that the concatenated matrix {U\U ) is orthogonal. Set z ■ 
in analogy to the earlier z = U'y. Then 



U'y 



(2.16) 
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where q < min{p, n — 1}. The bias of <t^ is ( 
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-_g+iS,i- Consistency 
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When q = p < n, the estimator af^ reduces to the least squares estimator 
^LS ~ ('^ ~ ^Lsl^i which is unbiased. When p = n, the estimator 

(7^ is a pure pooling estimator whose bias is small if {p — q)~^ J2^=q+i 
is nearly zero. We will seek to arrange this through choice of the penalty 
matrix D. 



2.3. Adaptive ST estimators. For t>0 and I <i <p, let hi{t, z) = [1 — 
t/\zi\]+. Let 

(2.17) .^sT = -^st(p) = {/ G [0, 1]^ : /, = h^it, z) for t > and 1 < i < p}. 

Unlike the monotone class Tms defined in (2.8), the class of shrinkage 
vectors is data dependent. Let 

(2.18) Ist(^,/) = {/^:/G^St}, 

multiplication being performed componentwise as in S. The algebraic iden- 
tity hi{t,z)zi = sgn(zj)[|zj| — t]+ connects ^sTiD,f) with the definition of 
soft-thresholding in Donoho and Johnstone (1995). The candidate ST esti- 
mators for rj associated with penalty matrix D are 

(2.19) fjsT{D,f) = UisT{D,f) = Udmg{f}U'y, feTsT- 

Let G denote the empirical cumulative distribution function of the {\zi\:l <i 
let G = E(G) and define 

(2.20) rsT{f,C,cT^) = Al-2G{t)]+ {u^ At^) dG{u), feJ^sT, 

Jo 

where A denotes the minimum operator. It follows from Stein (1981) that 
the normalized quadratic risk of the candidate ST estimator is 

(2.21) R{fjsT{D,f),v,a^)=rsT{f,^,a^), /S^ST- 
Having devised a variance estimator a^, we may estimate this risk by 

(2.22) rsT{D,f) = a''[l-2G{t)]+ {uAtfdG{u), fe^sT- 

Jo 
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Let tp = (21og(p))"^/^. For fixed penalty matrix D, the shrinkage- adaptive 
ST{D) estimator is defined to be ?7st(-D, /st)) where 

(2.23) fsT = h{t,z) where t = argminfsT(-D,t), 

te[o,tp] 

as in Donoho and Johnstone (1995). Because t must be one of the values 
{\zi\:l<i<p}, it can be computed readily. 

2.4. Adaptive HS estimators. Let pi = [apj , where [-J denotes integer 
part and the split fraction a £ [0, 1]. For any vector k £ R^, define the sub- 
vectors A;(i) = {ki :l < i < pi} and k(^2) = {h :pi + 1 < ^ < p} of respective 
dimensions pi and P2 = P — Pi- Candidate HS estimators apply separate 
shrinkage strategies to the subvectors 2(1) and z^2) of We focus on the 
MS X ST hybrid because it proves particularly effective in the examples to 
be considered. The definitions of the MS x MS, ST x ST and of ST x MS 
hybrids are analogous. 

Efromovich (1999) considered HS of wavelet coefficients in which MS is re- 
placed by a certain linear shrinkage methodology and ST is replaced by hard- 
thresholding. In both that paper and here, the aim is to compromise ben- 
eficially between a shrinkage approach that assumes regression coefficients 
are ordered in importance and a shrinkage approach that relies on sparsity 
of important regression coefficients. Considerable technical differences exist. 
We apply adaptive MS rather than Efromovich-Pinsker shrinkage to the 
low- frequency regression coefficients. On the remaining coefficients, we use 
ST rather than hard-thresholding and select the soft-threshold to minimize 
estimated risk. The regularity conditions for Stein's (1981) risk estimator 
are satisfied by soft-thresholding but not by hard-thresholding. 

Let 

(2.24) ^HS = {/ : /(I) G -^Ms(pi), /(2) G -^St(P2)} 
and let 

(2.25) ^ns{D,a,f) = fz, /G^hs- 

The candidate MS x ST HS estimators for r] associated with penalty ma- 
trix D are defined by 

(2.26) ms{D,a,f) = Uius{D,a,f) = Udiag{f}U'y, /G.Fhs. 

From the preceding sections, it follows that the normalized quadratic risk 
of this candidate HS estimator is 

R{ms{D,a,f),r],a^) 

(2.27) 

= P ^[Pl?'Ms(/(l),?(l),Cr^) +P2rsT(/(2),^(2),0-^)], / G -^HS- 
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LS PLS(D4) 


MS(D4) 


ST(D4) 


HS(D4) 


0.0115 0.0093 


0.0071 


0.0047 


0.0039 



Write fMs(-C'i /(I)) for the risk estimator (2.13) computed on the subvec- 
tor 2(1). Similarly, write rsT(-C,/(2)) for the risk estimator (2.22) computed 
on the subvector z^2) ■ The risk of the candidate HS estimator is then esti- 
mated by 

(2.28) rns{D,a,f)=p-^[pifMs{D,fii))+P2fsT{D,fi2))], /G-^hs- 

For fixed penalty matrix D and split fraction a, the shrinkage- adaptive 
HS(D) estimator is defined to be J7hs(^) /hs)i where 

(2.29) /Hs = argminfHs(^,a,/)- 

The minimization is accomplished by minimizing separately each of the two 
summands on the right-hand side of (2.28) in the manner discussed previ- 
ously. 

2.5. A case study. Figure 1 presents competing fits to monthly Aus- 
tralian red wine sales (in kiloliters) from January 1980 to October 1991. 
The data are taken from Brockwell and Davis (1996) and the ordinal fac- 
tor is month. Here n = p = 142. The penalty matrix is the fourth difference 
operator D4, which is defined explicitly in Section 3.2. The high compo- 
nent variance estimate fiy is determined by (2.16) with q= [0.85pJ. The 
partition in the definition of HS(D4) uses a = 0.3. Adaptation to minimize 
estimated risk selected the values of a and of the penalty matrix from a class 
of possibilities described in Section 3.3. The estimated risks of the competing 
estimators are shown in Table 1. 

The LS fit (not shown) coincides with the raw data. On the basis of es- 
timated risk, PLS(D4) is only a modest improvement over LS, MS(D4) is 
preferable, while ST(D4) and HS(D4) are substantially preferable, the hy- 
brid estimator being best. Theorem 4.1 shows that, under model (1.1), the 
estimated risks of these adaptive estimators approximate their risks under 
the saturated model as p tends to infinity. 

On looking closely at Figure 1, we discern a regular seasonal pattern in 
the HS(D4) and ST(D4) fits. Each year, estimated mean monthly red wine 
sales rise steadily from an annual low in January to a peak around July or 
August (winter in Australia) and then drop into a trough with a secondary 
peak around November or December (in time for the Christmas holiday sea- 
son). The adaptive fits with smallest estimated risk have recovered a highly 
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intelligible seasonal pattern in sales that may be linked to seasonal patterns 
in market demand and in winery operations after harvest and fermentation. 

Figure 2 examines what is going on behind the fits. The residuals from 
the HS(D4) fit are plausibly homoscedastic. A Q-Q plot (not shown) in- 
dicates that their marginal distribution is roughly normal, apart from out- 
liers. This illustrates the tendency of our procedures to fit the data in terms 
of the motivating model. Subplot (1,2) plots the transformed components 
{Izjl-"^/^ sgn(zj) : 1 < i < p} of the coefficients z = U'y. The square root trans- 
formation reduces the vertical range of the plot and makes more visible the 
behavior of small components of z. Evidently, the first four columns of U are 
crucial in representing y and so r]. Blips in this plot at certain higher-order 
components suggest that the corresponding basis vectors may also be impor- 
tant in estimating rj. We call subplot (1,2) an empirical basis economy plot. 
The concept of basis economy is treated formally in Section 3.1. As well, 
this subplot suggests the choice of q that enters into the high-component 
variant estimator a^. 

The four shrinkage vector subplots in Figure 2 display the shrinkage vec- 
tors that define the competing adaptive fits. Because the shrinkage vectors 
of the PLS(D4) and MS(D4) estimates are necessarily monotone, both give 
considerable weight to many components of z so as not to disregard the 
small blips discussed above. The ST(D4) and HS(D4) estimates are better 
able to select the more important components of z, thereby reducing esti- 
mated risk through tradeoff of estimated variance against bias. Note that 
the HS(D4) estimate disregards more of the higher-order components of z 
than does ST(D4). 

2.6. Experiments with artificial data. Figures 3 and 4 exhibit the com- 
peting adaptive estimators on two sets of artificial monthly data that bracket 
the situation found in Example 1. In this experiment, p = n = 200, the factor 
levels are {sj = i:l <i < 200}, and the means at which we have one noisy 
observation are 

Smooth: mi(si) = 2 - 50((si/200 - 0.25)(si/200 - 0.75))^, 
Very Wiggly: m2(si) = mi(si/200) - 0.25sin(1007r(si/200)). 

The observations are given by yi = m{si) + e^, where the {cj} form a single 
pseudo-random sample drawn from the A^(0, (j^/200) distribution with a = 
0.5. In the data analysis, the variance o"^ is estimated by the high component 
estimator ct^ defined in (2.16), with (7 = 0.75p. 

Fitting this artificial data is a one-way layout problem rather than a curve 
estimation problem because the measurements are deemed to be monthly as 
in Example 1. The means in the Smooth case vary more slowly than those 
estimated in Example 1, while the means in the Very Wiggly case vary more 
rapidly. The goal is to learn how the competing adaptive estimators perform 
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in both scenarios. The first rows of Figures 3 and 4 give the scatterplots of 
the Smooth and Very Wiggly data, respectively. To the human eye, these 
scatterplots are scarcely distinguishable. Good estimators of the unknown 
mean vectors seek to do better than the eye. 

The penalty matrix used for both sets of artificial data is the fourth 
difference operator D4. The partition in the definition of HS(D4) uses a = 
0.05. Adaptation to minimize estimated risk selected these values of a and 
of the penalty matrix from a class of possibilities described in Section 3.3. 
According to the asymptotics in Section 4, the risk, loss and estimated risk 
all converge to a common limit. In the present experiment with artificial 
data, the losses are readily computed. For the Smooth data, the estimated 
risks and actual losses of the competing estimators are shown in Table 2. 

We note that the estimated risks for the shrinkage adaptive estimators are 
negative. The actual losses are small and convergence to asymptotic limits 
has not happened. Nevertheless, the estimated risks reflect the ordering of 
the true losses. In Figure 3 the visual quality of the competing fits follows the 
same ordering. The interpolated ST(D4) estimate is unsatisfactorily jagged, 
though certainly better than the LS estimate. The MS(D4) and HS(D4) 
estimates are close to the truth, though the latter exhibits a small ripple 
not present in the actual mean vector. The basis economy plot in the last 
subplot of Figure 3 suggests that the D4 penalty basis is economical in 
this example. This is verified by examining the corresponding plot of (not 
shown) computed from the true mean function. 

For the Very Wiggly data, the estimated risks and actual losses of the 
competing estimators are shown in Table 3. 

In Figure 4, interpolating lines have been added to guide the eye through 
the sequence of means or estimated means. They have no further signifi- 
cance because we are not doing curve estimation. The HS(D4) estimate is 

Table 2 





LS 


MS(D4) 


ST(D4) 


HS(D4) 


Estimated risk 
Loss 


0.2846 
0.2325 


-0.0434 
0.0072 


-0.0296 
0.0358 


-0.0449 
0.0077 


Table 3 




LS 


MS(D4) 


ST(D4) 


HS(D4) 


Estimated risk 
Loss 


0.2842 
0.2325 


-0.0063 
0.0313 


-0.0239 
0.0447 


-0.0350 
0.0285 
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best visually, as well as in loss. The HS(D4) and ST(D4) estimates both 
indicate the amplitude of the high frequency component in the unknown 
mean more successfully than the MS(D4) estimate. However, the actual loss 
of the ST(D4) estimate exceeds that of the MS(D4) estimate. Both casual 
scrutiny and the ordering of the estimated losses make ST(D4) look better 
than it is. Evidently the asymptotics have not fully taken hold. The basis 
economy plot in subplot (3, 2) of Figure 4 reveals the possible importance 
of component zio2- In the Very Wiggly case, the D4 penalty basis is sparse 
in the sense that most components of are small. However, it is not eco- 
nomical because a high-order basis vector is needed to approximate the high 
frequency sinusoidal component in the mean. 

In this experiment, the HS(D4) estimate, unlike the others considered, 
performs well in both the Smooth case and the Very Wiggly case. This is 
empirical evidence in its favor. 

3. Penalty matrix and split fraction. For monotone shrinkage, the ideal 
choice of basis U would have its first column proportional to the unknown 
mean vector rj so that only the first component of ^ = U'r] is nonzero. Then 
the choice of shrinkage vector / to minimize risk would have first component 
equal to 1 and all other components equal to 0. Though unrealizable, this 
ideal choice indicates that prior information or conjecture about r] should be 
exploited in selecting U. We say informally that the columns of U provide an 
economical basis for the regression space if all but the first few components 
of 1^ are very nearly zero. Construction of the basis U via a penalty ma- 
trix D — the method used in this paper — is a practical way of using vague 
prior information or conjecture about the function m to find a plausibly 
economical basis for expressing the mean vector rj. 

3.1. The role of basis economy. Mathematical analysis of an idealized 
economy concept reveals the importance of basis economy in reducing risk 
through monotone shrinkage. For every b S (0, 1], let ^m(^) = {a € i?^ : Oj = 
1 ii 1 <i <bp,l < a|^6pj+i < ■ " " < ^ oo}. For every a G Suib), every r > 
and every > 0, define the ellipsoid 

(3.1) E{r, a, a^) = {C e : ave(aC^) < a^r}. 

If ^ G E{r,a,a'^) and Oj = oo, it is to be understood that = and = 
0. We consider bases U such that, in the resulting canonical model, ^ G 
E{r,a,a'^) for some r > 0, some a G <?m(^) and some b G (0, 1]. 

A finite-dimensional specialization of Pinsker's (1980) theorem, given by 
Beran and Diimbgen (1998), implies the next theorem on asymptotic mini- 
maxity of adaptive MS estimators of rj. The proof follows from the discus- 
sion m Section 4 of Beran (2000). Let = a^[{j/a)^/^ - 1]+, where 7 is the 
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unique positive number such that ave(^o) ~ Define 
(3.2) i/p(r, a, a^) = aveifjia^ + ^o)]- 



Theorem 3.1. Fix the penalty basis U by choice of D or otherwise. For 
every b G (0, 1], every a E Suib), every r > and every cr^ > 0, 



(3.3) 



lim 

p— »oo 



inf sup R{fi,r],a'^)/vp{r,a,a'^] 



1. 



The shrinkage-adaptive estimator fjMsiD , fms) achieves asymptotic minimax 
bound (3.3) in that 



(3.4) 



lim 

p— >oo 



sup R{'nMs{DjMs),r],a )/up{r,a,a 

CG-B(r,a,(72) 



1. 



What does this theorem tell us? First, note that the asymptotic minimax 
risk i'p{r,a,a'^) in (3.3) is monotone decreasing in the vector a. Thus, if 
^ = U'r] € E{r,a,a'^) for relatively small b and relatively large vector a — in 
other words, if the basis is economical for expressing 77 — then the asymptotic 
minimax risk is relatively small compared to the risk o"^ of the LS estimator. 
Second, (3.4) indicates that the adaptive MS estimator achieves the asymp- 
totic minimax risk for every degree of basis economy. Even a poor choice 
of basis for adaptive MS estimation does not lead to disaster relative to LS 
estimation. 

A special case of Theorem 3.1 makes both points obvious, albeit in a sim- 
plified setting. Let B{h) = {a G <?m(^) : = 00 if \hp\ + 1 < i < p}. In The- 
orem 3.1, replacing a € 6'm(^) with the stronger restriction a S B{b) and 
i'p{r,a,a'^) with the evaluation a'^rb/{r + b) gives a valid statement. In this 
simplified setting, basis economy corresponds to a small value of b. The ra- 
tio of the asymptotic minimax risk to the risk of the LS estimator is small 
whenever b is small; and the adaptive MS estimator is still asymptotically 
minimax. 



3.2. Local annihilators. Difference operators are well-established as penalty 
matrices for PLS when the ordinal factor levels s = {si,S2, ■ ■ ■ ,Sp), with 
si < S2 < ■ • • < Sp, are equally spaced [cf. Press, Teukolsky, Vetterling and 
Flannery (1992), Section 18.5]. To define the dth difference matrix Dd, first 
define the {p — 1) x p matrix A{p) = {Sij}, in which 5i^i = 1, 6i^i-^.i = —1 for 
every i and all other entries are zero. Then 



(3.5) Di = A{p), Dd = A{p-d+l)Dd^i ioi2<d<p. 
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Evidently, the {p — d) x p matrix annihilates powers of s up to power 
(i — 1 in the sense that 

(3.6) DdS^ = for < A; < d - 1. 

Here denotes the column vector (s^, . . . , s^)'. Moreover, in row i of D^, 
the elements not in columns i,i + 1, . . . ,i + d are zero. 

Suppose for simplicity that X = Ip. Let U be the penalty basis generated 
by penalty matrix Dd- By the variational characterization of U given in 
Section 2.1, the space spanned by the first d columns of U consists of vectors 
V that satisfy DdV = 0. When m behaves locally like a polynomial of degree 
d—1 and the value of d is modest, then this penalty basis is economical for rj. 
Such considerations support the use of difference operators as candidate 
penalty matrices when the factor levels are equally spaced. 

When m is expected to behave locally like a polynomial of degree d—1, 
but the factor levels in s are not equally spaced, we replace as follows. For 
every integer 1 <d <p, the local polynomial annihilator Ad is & {p — d) x p 
matrix characterized through three conditions. First, for every possible i, 
all elements in the ith row of Ad other than {aj j :i < j <i + d} are zero. 
Second, Ad satisfies the orthogonality conditions 

(3.7) Ads'' = for 0<k<d-l. 

Third, each row vector in Ad has unit length. These requirements are met 
by setting the nonzero elements in the ith row of Ad equal to the basis 
vector of degree d in the orthonormal polynomial basis that is defined on 
the d+1 design points (sj, . . . , Si+rf). The S-Plus function poly accomplishes 
this computation. When the components of s are equally spaced. Ad is just 
a scalar multiple of the dth difference matrix Dd- 

3.3. Adaptive choice of penalty matrix and split. As we have seen, a 
penalty basis ideally exploits, through choice of the penalty matrix, informed 
conjecture about the function m in (1.1). When this is the case, penalty bases 
are often reasonably economical. However, if the prior information is weak 
or flawed, some of the higher-order components of ^ may not be negligible. 
Soft-thresholding handles possibly isolated higher-order components of 
that need to be considered in the fit. The choice of dividing point pi between 
monotone shrinkage and soft-thresholding in the MS x ST HS estimator then 
becomes important. We will use the strategy of minimizing estimated risk 
to select D and pi, in addition to the shrinkage vectors. 

Given a set T> of candidate penalty matrices, such as {Ad-1 < d < k}, 
we select an empirically best MS estimator as follows. Over shrinkage class 
Tms and over penalty matrix class V, the fully adaptive MS estimator of 
T] is defined to be riv, MS — VMsiD, /), where 



(3.8) 



{DJ)= argmin r{D,f). 
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The fully adaptive ST estimator fix>,ST is defined analogously, replacing ^ms 
in (3.8) with ^ST- 

For HS estimators, it is also desirable to explore competing choices of 
pi = [ap\ , where [-J denotes integer part and candidate values of a lie in a 
specified subset A of [0,1]. Over shrinkage class Tusi over penalty matrix 
class V and over split fraction class A, the fully adaptive HS estimator of r] 
is defined to be fix>^A,us = 'nns{D,a, f), where 

(3.9) {D,aJ)= argmin r{D,aJ). 

The asymptotics in Section 4 support choosing a and D to minimize 
estimated risk provided the cardinalities of A and of V grow slowly as p 
increases. The numerical examples in Sections 2.5 and 2.6 used T> = {D^ : 1 < 
d < 6} and A = {0.05A;:0 < k < 20}. The asymptotics given do not care 
whether the candidate bases are constructed as penalty bases. However, 
minimizing estimated risk over a very large class of bases should not be 
expected to yield a good estimator of r]. For instance, the MS estimator that 
minimizes the estimated risk of U diag{f}U'y over all / G J-ms and over all 
permutations of the columns of a fixed basis matrix U is dominated by the 
LS estimator in the saturated model. Remark A on page 1829 of Beran and 
Diimbgen (1998) gives a proof. In such cases, the covering numbers used in 
the asymptotics of Section 4 are too large for Theorem 4.1 to hold. 

4. Asymptotics of adaptation. The main purpose of this section is to 
analyze the asymptotic loss and risk of the adaptive ST(D) and HS(D) es- 
timators under the saturated Gaussian one-way layout. The results build on 
techniques developed by Beran and Diimgben (1998) for adaptive MS(L') es- 
timators. First we show that minimizing estimated risk over shrinkage class 
•^MS or J^sT for fixed penalty matrix D succeeds in minimizing risk asymp- 
totically over that shrinkage class as the dimension p of the regression space 
tends to infinity. Moreover, the estimated risk of the adaptive estimator con- 
verges to its actual loss and risk. In this fashion, estimated risks provide a 
credible tool for ranking competing shrinkage estimators. Second, we pro- 
vide conditions under which simultaneous adaptation over shrinkage class 
•^HS) over penalty matrix class T> and over split fraction class A works in 
the senses just described. The results require no smoothness assumptions on 
the unknown mean vector r]. 

4.1. Adaptation works. For any vector h ^ RF , let = maxi< j<p | /i, | . 
The generic subscript F stands for J^ms or J^st or ^hs, according to the 
choice of candidate estimator class. 
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Theorem 4.1. Let T be either J^ms or J^st- Suppose that a"^ is consis- 
tent in that, for every c > and o"^ > 0, 



(4.1) lim sup E|(T -cr^ 

P^°°||5||<c 



(a) Let V{f) denote either the loss L{fijr(^D, f),r]) or the estimated risk 
fjr^D, /). Then, for every penalty matrix D, every c > and every > 0, 

(4.2) lim snpEsnp\V{f)-R{rjr{D,f),rj,a^)\ = 0. 

(b) /// = argminfjr(D,/), then 



(4.3) lim sup 



RivAD, f),v, f^') - mmR{ri^{D, f),ri, 

J 



(c) For W equal to either L{fjjr[D, f),rj) or R{f]jr[D, f),rj,a'^), 
(4.4) lim sup E|f^(L», /) - 1^1 = 0. 

P-'°°||C||<c 

(d) Let denote the cardinality ofV. Convergences (4.2) to (4.4) hold 
for the fully adaptive MS estimator fjx) ^ms defined through (3.8) if {^V)p~^^'^ 
and (#X')E|(5"^ — (t^| both tend to zero as p oo. They hold for the fully 
adaptive ST estimator if {ifV)p-'^/^{log{p))^/'^ and {#V)E\a^ - both 
tend to zero as p^ oo. 

(e) Convergences (4.2) to (4.4) hold for the fully adaptive HS estimator 
flv,A,KS defined in (3.8) i/max{#^, #P}p-i/2(log(p)) ^4 and max{#A,#V}E\a^ - 
(7^1 both tend to zero as p ^ oo. 

Parts (a)-(c) refer to the case of fixed D. By part (a), the loss, risk and 
estimated risk of a candidate estimator converge together, uniformly over 
^ = ^MS or ^sT- This makes the estimated risk of candidate estimators 
indexed by a trustworthy surrogate for true risk or loss. By part (b), 
the risk of the shrinkage-adaptive estimator fjj:[D,f) converges to that of 
the best candidate estimator. Part (c) shows that the loss, risk and plug-in 
estimated risk of an adaptive estimator converge together asymptotically. 
Part (d) extends these findings to MS and ST estimators that adapt over 
both / and D. Part (e) does the same for HS estimators that adapt over /, 
D and a. 

Condition (4.1) holds for the variance estimator a'^^ li n — p tends to 
infinity with p. Asymptotic results for other variance estimators are given 
in Beran (1996) and Beran and Diimbgen (1998). 
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4.2. Auxiliary result. The proof of Theorem 4.1 uses techniques from em- 
phical process theory. Theorem 4.2 below is taken from Beran and Diimbgen 
(1998). It fohows from standard symmetrization arguments and Pisier's 
(1983) form of the chaining lemma [see also Pollard (1990), Sections 2 and 3]. 
Let S = J2^=i 4'ii where (j)i,(j)2, ■ ■ ■ ,4>p are independent stochastic processes 
on an index set T. All have continuous sample paths with respect to some 
metric p on T such that (T, p) is separable. Define a random pseudo-metric 
m on T through 

(4.5) m\s,t)=j2[Ms)-Mt)f- 

1=1 

For any pseudo-metric v on T, define the covering numbers 

(4.6) N{u,T,i^)=mm\#To:ToCT, inf z>(to,t) < n Vt G t|. 



Theorem 4.2. Suppose that S{ti) = for some ti G T. Then there ex- 
ists a finite constant C > such that 



rD 

(4.7) Esup\S{t)-ES{t)\<CE log^/^[N{u,T,m)]du, 

tGT Jo 



r-D 

where D = supf^']-rh{t,ti). 



4.3. Proof of Theorem 4.1. The portion of Theorem 4.1 that concerns 
^ = ^MS follows from results in Section 6 of Beran and Diimbgen (1998). 
We continue by proving parts (a)-(c) for J^ = J^sT- For this discussion of 
soft-thresholding, let T = [0,tp] with tp = {2log{p)y/^ . 

(a) Suppose that V{f) = fsT{D, f) for / G J^sT- In view of (2.22) and (4.1), 
it suffices to show that 

(4.8) lim sup E sup \G{t) - G{t) \ = 

P^°°ll£ll<c tGT 



and 

/•oo 

(4.9) lim sup Esup / '-'^ " 

P^°°m\<c teT Jo 



u"^ At^)d[G{u)-G{u)] 



0. 



In Theorem 4.2, take cl)i{t) =p-'^I{\zi\ <t). Then S{t) = G{t), iv?{s,t) 
p-^\G{t) - G(s)|, ti = Q,b = p-V2^ and 

N{u, T, rh) = min 

I toGTo 

(4.10) 

< 1 + ipu^)-\ 
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Then 



(4.11) 



log" ''^[N{u,T,m)]du< / \og''l'^[l + {pu'^)-^]du 



= p-i/2 [\og^/\l + v-^)dv. 
Jo 

Because the rightmost integral is finite, (4.11) and (4.7) imply 
(4.12) Esup|G(t)-G(t)| <Cp-i/2, 

Limit (4.8) follows. 
Next, observe that 



roo p p 

/ (u^ A t^) dG{u) = p-^ ^'^(1^* I < i) + P"' E I > 

(4.13) 

= Si{t) + S2{t), say. 

To analyze 5'i(i), let 0i(t) = j5^^z,f/(|zi| < t). For any integer r > 1, let 

p 

(4.14) a,=p-iEl^ir- 

i=l 

Now, using Cauchy-Schwarz, m?{s,t) < p~^ay^\G{t) — G(s)|^/^ and D < 
p~^/^ag^^. By reasoning akin to that in (4.10), 

(4.15) N{u,T,m)<l + as{p'^u^)~^. 

Consequently, by (4.7) and a calculation like that in (4.11), 

/■I 

Esup|5i(t)-E5i(t)| <C7p-i/2Eag/^ / log^/^^l + t;"^) dt; 
ter JO 

(4.16) 

< C'p~^'^¥.ay\ 
To analyze S'2(t), let (/>i(i) =p~^t2/(|zi| > t). If s < t, 

p'lUs) - Ut)? = [(s' - t^)I{\z^\ >t) + S^I{S <\Z^\< t)f 

(4.17) < 2(s2 - t^fl{\zi\ >t) + 2s^I{s < \zi\ < t) 

< 8zf{s - tfl{\zi\ >t)+ 2zfl{s <\zi\<t). 

Similarly for t < s. From this and Cauchy-Schwarz, 7v?(s,t) < mf(s,t) + 
7712(3, t), where 

mlis,t) = p~h{s - t)^al^^[l - G(max(s,^))]^/^ 

(4.18) 

ml{s,t)=p-Ws/'\G{s)-G{t)\'/^. 



HYBRID SHRINKAGE IN ONE-WAY ANOVA 25 

By the first line in (4.17), D < Ap~^^'^a\^'^ < Ap~^^'^ag^^ for some finite 
constant A. Moreover, 

N{u,T,mi) < l + 8^/2^y%(//\)~\ 

(4.19) 

N{u,T,m2) < l + 4a8(pV)-i 

by reasoning similar to that for Si{t). 
Because m(s, t) < mi(s, t) + 7712(3, t), 

(4.20) N{u,T,rh) < 2max{N{u/2,T,mi), N{u/2,T,rh2)} 

and so 

rD 



log^/'^[N{u,T,m)]du 

rS 

(4.21) <2^/2/ log^/^[N{u/2,T,mi)]du 

Jo 

rD 

+ 2^/2 / iog^/^[N{u/2,T,m2)]du. 
Jo 

The expectation of the second integral on the right-hand side is bounded 
from above by a constant times p~^/^, as in (4.16). The expectation of the 
first integral on the right-hand side is bounded from above by a constant 
times Hence, by Theorem 4.2, 



(4.22) Esup \S2{t) - ES2{t)\ < C'lp~^'^ + C'ip~^'hll 



2 

P 



Limit (4.9) now follows from (4.16) and (4.22). This establishes (4.2) for 
V{f) = fsT{D,f). 

Next, suppose that Vif) = L{fjsT{D, f),r]) = p-^\isT{DJ) for / G ^ST- 
The ith component of S,st{D, f) is 

(4.23) isTAD, f) = sgnizi){\zi\ - t)+ = Zi - i\zi\ A t) sgn(zi). 
Hence, 

Vif) =p-^E(^. -4)' +P"^E(k.i At)2 

1=1 i=l 

(4.24) 

p 



2j2izi-(,i)i\zi\At)sgn{zi). 



i=l 



On the right-hand side of this equation, the Li convergence, uniformly over 
t > 0, of the second term is given by (4.9) and is immediate for the first 
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term. It remains to verify this mode of convergence for 

V 

- ii){\zi \ M) sgn(zi) 

i=l 

P P 

(4.25) =Y,iZi-C^)ziI{\zi\<t)+J2iZi-^^)tIi\zi\>t) 

i=l i=l 

= Ti{t) + T2{t), say. 

For i = 1,2, the analysis of Tj(t) parahels that given for Si{t) after (4.13). 
Limit (4.2) now fohows for V{ f) = L{fjsT{D, f),r]). 

(b) and (c) In analogy to / = argminjgjrfjr(I), /), let / = argminjgjrrjc-(/,^, 
Then m.mf(zjrR{'fjjr(D,f),r],a'^) = rjF(/,^,(T^). We first show that (4.2) im- 
plies 

(4.26) lim sup E|r-r^(/,e,<T^)|=0, 

f-°°|15||<c 

where T can be L{fiyr{D, f),r]) or L{r]yr{D, f),i]) or fyr{D,f). 
Indeed, (4.2) with V{f) = ryr{D,f) entails 

hm sup E\f:p{D,f)-r^if,C,a^)\ = 0, 

(4.27) 

hm sup E\rAD,f)-r^{f,^,a^)\=0. 

Hence, (4.26) holds for T = rjr(^D, f) and 

(4.28) hm sup E|r^(/,e,cT')-rH/,e,cT2)|=0. 

On the other hand, (4.2) with V{f) = L(fjjr{D, f),rj) gives 
lim sup E\L{fi:r{D,f),rj)-r^{f,C,a^)\=0, 

(4.29) 

lim sup E\L{fi^{Dj),ri)-r^{f,C,a'')\=0. 

These limits, together with (4.28), establish the remaining two cases of (4.26). 

The limits (4.3) and (4.4) are immediate consequences of (4.26). 

(d) By Theorem 2.1 of Beran and Diimbgen (1998), limit (4.2) with jr = 
^MS can be strengthened to 

(4.30) supE sup \V{f)-R{f]^{D,f),7],a^)\<Cip-^/^ + C2E\a^-a% 
where the Q are finite constants. The first assertion of part (d) follows. 
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The arguments above for T = jTgx imply that 

sup E sup \V{f)-R{f,r{DJ),i^,a'')\ 

m\<c /e^sT 

(4.31) 

< Cip"i/2(log(p))i/4 + C2E|a2 -a\ 

where the Cj are finite constants. The second assertion of part (d) follows, 
(e) Part (e) similarly follows from (4.30) and (4.31). 
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